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Abstract 

This paper considers parallel Grobner bases al- 
gorithms on distributed memory parallel comput- 
ers with multi-core compute nodes. We summa- 
rize three different Grobner bases implementations: 
shared memory parallel, pure distributed memory 
parallel and distributed memory combined with 
shared memory parallelism. The last algorithm, 
called distributed hybrid, uses only one control 
communication channel between the master node 
and the worker nodes and keeps polynomials in 
shared memory on a node. The polynomials are 
transported asynchronous to the control-flow of the 
algorithm in a separate distributed data structure. 
The implementation is generic and works for all 
implemented (exact) fields. We present new perfor- 
mance measurements and discuss the performance 
of the algorithms. 

1 Introduction 

We summarize parallel algorithms for computing 
Grobner bases on todays cluster parallel comput- 
ing environments in a Java computer algebra sys- 
tem (JAS), which we have developed in the last 
years [23 [29] . Our target hardware are distributed 
memory parallel computers with multi-core com- 
pute nodes. Such computing infrastructure is pre- 
dominant in todays high performance computing 
clusters (HPC). The implementation of Grobner 
bases algorithms is part of the essential building 
blocks for any computation in algebraic geometry. 
Our aim is an implementation in a modern object 
oriented programming language with generic data 
types, as it is provided by Java programming lan- 
guage. 



Besides the sequential algorithm, we consider 
three Grobner bases implementations: multiple 
threads using shared memory, pure distributed 
memory with communication of polynomials be- 
tween compute nodes and distributed memory com- 
bined with multiple threads on the nodes. The last 
algorithm, called distributed hybrid, uses only one 
control communication channel between the master 
node and the worker nodes and keeps polynomials 
in shared memory on a node. The polynomials are 
transported asynchronous to the control-flow of the 
algorithm in a separate distributed data structure. 
In this paper we present new performance measure- 
ments on a grid-cluster [B] and discuss performance 
of the algorithms. 

An object oriented design of a Java computer al- 
gebra system (called JAS) as type safe and thread 
safe approach to computer algebra is presented in 
[SI [H US]. JAS provides a well designed 
software library using generic types for algebraic 
computations implemented in the Java program- 
ming language. The library can be used as any 
other Java software package or it can be used in- 
teractively or interpreted through an Jython (Java 
Python) front-end. The focus of JAS is at the 
moment on commutative and solvable polynomials, 
Grobner bases, greatest common divisors and ap- 
plications. JAS contains interfaces and classes for 
basic arithmetic of integers, rational numbers and 
multivariate polynomials with integer or rational 
number coefficients. 

1.1 Parallel Grobner bases 

The computation of Grobner bases (via the Buch- 
berger algorithm) solves an important problem for 
computer algebra [5]- These bases play the same 



role for the solution of systems of algebraic equa- 
tions as the LU-decomposition, obtained by Gaus- 
sian elimination, for systems of linear equations. 
Unfortunately the computation of such polynomial 
bases is notoriously hard, both with sequential and 
parallel algorithms. So any improvement of this al- 
gorithm is of great importance. For a discussion 
of the problems with parallel versions of this algo- 
rithm, see the introduction in |28j . 

1.2 Related work 

In this section, we briefly summarize the related 
work. Related work on computer algebra libraries 
and an evaluation of the JAS library in comparison 
to other systems can be found in [2H H5J . 

Theoretical studies on parallel computer algebra 
focus on parallel factoring and problems which can 
exploit parallel linear algebra [321 [IT]. Most re- 
ports on experiences and results of parallel com- 
puter algebra are from systems written from scratch 
or where the system source code was available. A 
newer approach of a multi-threaded polynomial li- 
brary implemented in C is for example 9 . From 
the commercial systems some reports are about 
Maple [5] (workstation clusters), and Reduce [35] 
(automatic compilation, vector processors). Multi- 
processing support for Aldor (the programming 
language of Axiom) is presented in |36j . Grid aware 
computer algebra systems are for example [55] . 
The SCIEnce project works on Grid facilities for 
the symbolic computation systems GAP, KANT, 
Maple and MuPAD [40] [44] . Java grid middle- ware 
systems and parallel computing platforms are pre- 
sented in [TBI HH El HOI HI H] • For further overviews 
see section 2.18 in the report [TH] and the tutorial 

For the parallel Buchberger algorithm the idea 
of parallel reduction of S-polynomials seems to be 
originated by Buchberger and was in the folklore for 
a while. First implementations have been reported, 
for example, by Hawley [T7] and others [331 SSI EI] • 
For triangular systems multi-threaded parallel ap- 
proaches have been reported by [34l [37] . 

1.3 Outline 

Due to limited space we must assume that you are 
familiar with Java, object oriented programming 
and mathematics of Grobner bases [5J. Section 2 



introduces the expected and developed infrastruc- 
ture to implement parallel and distributed Grobner 
bases. The Grobner base algorithms are summa- 
rized in section 3. Section 4 evaluates several as- 
pects of the design, namely termination detection, 
performance, the 'workload paradox', and selection 
strategies. Finally section 5 draws some conclu- 
sions and shows possible future work. 

For the convenience, this paper contains sum- 
maries and revised parts of [551 I2S] to explain the 
new performance measurements. Performance fig- 
ures and tables are presented throughout the paper. 



Explanations are in section 4.2 



2 Hard- and middle-ware 

In this section we summarize computing hardware 
and middle-ware components required for the im- 
plementation of the presented algorithms. The 
suitability of the Java computing platform for par- 
allel computer algebra has been discussed for ex- 
ample in [251 125]. 

2.1 Hardware 

Common grid computing infrastructure consists of 
nodes of multi-core CPUs connected by a high- 
performance network. We have access to the bw- 
GRiD infrastructure [6 . It consists of 2 x 140 8-core 
CPU nodes at 2.83 GHz with 16 GB main memory 
connected by a 10 Gbit InfiniBand and 1 Gbit Eth- 
ernet network. The operating system is Scientific 
Linux 5.0 and has shared Lustre home directories 
and a PBS batch system with Maui scheduler. 

The performance of the distributed algorithms 
depend on the fast InfiniBand networking hard- 
ware. We have done performance tests also with 
normal Ethernet networking hardware. The Ether- 
net connection of the nodes in the bwGRiD cluster 
is 1 Gbit to a switch in a blade center containing 
14 nodes and a 1 Gbit connection of the blade cen- 
ters to a central switch. We could not obtain any 
speedup for the distributed algorithm on an Ether- 
net connection, only with the InfiniBand connec- 
tion a speedup can be reported. 

The InfiniBand connection is used with the 
TCP/IP protocol. The support for the direct Infini- 
Band protocol, by-passing the TCP/IP stack, will 
eventually be available in JDK 1.7 in 2010/11. The 
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evaluation of the direct InfiniBand access will be 
future work. 

2.2 Execution middle- ware 

In this section we summarize the execution and 
communication middle- ware used in our implemen- 
tation, for details see [55J The execution 
middle-ware is general purpose and independent 
of a particular application like Grobner bases and 
can thus be used for many kinds of algebraic algo- 
rithms. 



GBDist 



Distributed 
Thread Pool 



ExecutableServer 



Reducer 
Server 
GBQ 
DHT 
Client 



DHT Server 



Distributed 
Thread 



Reducer 
Client 



clientPart() 

DHT 

Client 



master node 




client node 



Figure 1: Middleware overview. 

The infrastructure for the distributed partner 
processes uses a daemon process, which has to 
be setup via the normal cluster computing tools 
or some other means. The cluster tools available 
at Mannheim use PBS (portable batch system). 
PBS maintains a list of nodes allocated for a clus- 
ter job which is used in a loop with ssh-calls to 
start the daemon on the available compute nodes. 
The lowest level class ExecutableServer imple- 
ments the daemon processes, see figure [T] They 
receive serialized instances of classes which imple- 
ment the RemoteExecutable interface and execute 
them (call their runO method). On top of the low 
level daemons is a thread pool infrastructure, which 
distributes jobs to the remote daemons, see classes 
DistThreadPool and DistPoolThread in figure [l] 

The communication infrastructure is provided on 
top of TCP/IP sockets with Java object serializa- 
tion. In case of the distributed hybrid algorithm 
we have only one TCP/IP connection (for control) 
between the master and the remote threads. To 



be able to distinguish messages between specific 
threads on both sides we use tagged messages chan- 
nels. Each message is send together with a tag 
(an unique identifier) and the receiving side can 
then wait only for messages with specific tags. For 
details on the implementation and alternatives see 



2.3 Data structure middle-ware 

We try to reduce communication cost by employ- 
ing a distributed data structure with asynchronous 
communication which can be overlapped with com- 
putation. Using marshalled objects for transport, 
the object serialization overhead is minimized. This 
data structure middle- ware is independent of a par- 
ticular application like Grobner bases and can be 
used for many kinds of applications. 

The distributed data structure is implemented by 
class DistHashTable, called 'DHT client' in figure 
[T] see also figure [7j It implements the Map inter- 
face and extends AbstractMap from the java.util 
package with type parameters and can so be used 
in a type safe way. In the current program version 
we use a centralized control distributed hash table, 
a decentralized version will be future work. For the 
usage of the data structure the clients only need 
the network node name and port of the master. In 
addition there are methods like getWaitO, which 
expose the different semantics of a distributed data 
structure as it blocks until an element for the key 
has arrived. 



3 Grobner bases 

In this section we summarize the sequential, the 
shared memory parallel and distributed versions of 
algorithms to compute Grobner bases as described 
in [23 [29] . For the mathematics of the sequential 
version of the Buchberger algorithm see [5] or other 
books. 

3.1 Sequential Grobner bases 

The sequential algorithm takes a set of (multivari- 
ate) polynomials over a field as input and produces 
a new set of polynomials which generates the same 
polynomial ideal but additionally the reduction re- 
lation with respect to the new set of polynomials 
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GroebnerBase 



*■ isGB(F : List<GenPolynomial>) : boolean 

^ GB(F : List<GenPoiynomiai>) : List<GenPoiynomial> 

f extGB(F : List<GenPolynomial>) : ExtendedGB 

^ minimalGB(G : List<GenPolynomial>) : List<GenPolynomial> 



Reduction 



+ normalform{F : List<GenPolynomial>, p : GenPolynomial) : GenPolynomial 



GroebnerBaseAbstract 



+ GrobnerBaseAbstract{red : Reduction) 

+ isGB(F : List<GenPolynomial>) : boolean 

+ isGB(modv : int, F : List<GenPolynomial>) : boolean 

+ GB{F : List<GenPolynomial>) : List<GenPolynomial> 

+ GBfmodv : int, F : List<GenPolynomial>) : List<GenPolynomial> 

+ extGB{F : List<GenPolynomial>) : ExtendedGB 

+ extGB(modv : int, F : List<GenPolynomial>) : ExtendedGB 

+ minimalGB(G : List<GenPolynomial>) : List<GenPolynomial> 



GroebnerBaseSeq 



*■ GroebnerBaseSeq(red : Reduction) 

♦ GB(modv : int. F : List<GenPolynomial>) : List<GenPolynomia 



GroebnerBaseParallel 



*■ GroebnerBaseParallel{threads : int, red : Reduction) 

♦ GB{modv : int, F : List<GenPolynomial>) : List<GenPolynomial> 



GroebnerBaseDistributed 



i- GroebnerBaseDistributed(threads : int, red : Reduction, port : int) 
i- GB{modv : int, F : List<GenPolynomial>) : List<GenPolynomial> 



GroebnerBaseDistributedHybrid 



* GroebnerBaseDistributed Hybrid{threads : int, tpernode : int, red : Reduction, port : int) 

* GB(modv : int, F : List<GenPolynomial>) 



Figure 2: UML diagram of Grobner base classes. 

has unique normal forms. The implementation is 
generic and works for all (exact) fields implemented 
in JAS and also handles the case of modules over 
polynomial rings. In the algorithm, first a set of 
critical pairs is generated, then the S-polynomial 
of each critical pair is checked if it can be reduced 
to zero. If not, the resulting reduction rest is added 
to the set of polynomials and new critical pairs 
are generated. The algorithm terminates if all S- 
polynomials of critical pairs reduce to zero (which 
is guarantied to happen by Dickson's lemma) . The 
implementation only uses Buchberger's first and 
second criterion (see [5]). Optimizations like the 
F4 or F5 algorithm [TJ EH EH EE] are not incor- 
porated. In this paper we focus on the comparison 
of the 'simple' sequential Buchberger algorithm to 
'simple' parallel and distributed algorithms without 
interference with further optimizations. Optimized 
algorithms will be studied and compared in future 
work. 



The implementation of the parallel and dis- 
tributed versions is based on the sequential algo- 
rithm. These algorithms are implemented follow- 
ing standard object oriented patterns (see figure[2]). 
There is an interface, called GroebnerBase, which 
specifies the desirable functionality. Then there is 
an abstract class, called GroebnerBaseAbstract, 
which implements as many methods as possible. 
Finally there are concrete classes which extend 
the abstract class and implement different algo- 
rithmic details. For example GroebnerBaseSeq im- 
plements a sequential, GroebnerBaseParallel im- 
plements a thread parallel, GroebnerBaseDistri- 
buted implements a network distributed version of 
the Grobner base algorithm as described in [28] . 
GroebnerBaseDistributedHybrid implements the 
hybrid algorithm as described in |29) . 

The polynomial reduction algorithms are imple- 
mented by methods normalf orm() in classes Re- 
ductionSeq and ReductionPar. The later class 
does not implement a parallel reduction algorithm, 
as its name may suggest, but a sequential algorithm 
which can tolerate and use asynchronous updates 
of the polynomial list by other threads. A parallel 
reduction implementation is still planed for future 
work. 

3.2 Parallel Grobner bases 

The shared memory parallel Grobner bases algo- 
rithm is a variant of the classical sequential Buch- 
berger algorithm and follows our previous work 
in |22j . It maintains a shared data structure, 
called pair list, for book keeping of the compu- 
tations. This data structure is implemented by 
classes CriticalPairList and OrderedPairList, 
Both have synchronized methods 



4.4 



see section 

put() and getNextO respectively removeNext () 
to update the data structure and acquire a pair for 
reduction. In this way the pair list is used as work 
queue in the parallel and the distributed implemen- 
tations. As long as there are idle threads, critical 
pairs are taken from the work queue and processed 
in a thread. The processing consists of forming S- 
polynomials and doing polynomial reductions with 
respect to the current list of polynomials. When a 
reduction finished and the result polynomial is non- 
zero, new critical pairs are formed and the polyno- 
mial is added to the list of polynomials. Note, due 
to different computing times needed for reduction 
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of polynomials the results may finish in a different 
sequence order than in the sequential algorithm. 



GBs of Katsuras example on a grid closter 



see section 4.4 As the proof of the 'simple' Buch- 
berger algorithm does not depend on a certain se- 
quence order, the correctness of the parallel and 
distributed algorithms is established. 
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Figure 3: Parallel performance, example Katsura 



Table 1: Parallel timings for figure [3j 
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3.3 Parallel solvable Grobner bases 

The parallel algorithms are also implemented for 
solvable polynomial rings with left, right and two- 
sided variants. As in the commutative case the 
reductions are performed in parallel by as many 
threads as are specified. The right sided Grobner 
base computation is done via the opposite ring and 
delegation to the left sided parallel computation. 
This is not discussed in this paper. 
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Figure 4: Parallel performance, example Cyclic 6. 



Table 2: Parallel timings for figure [4| 
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3.4 Distributed Grobner bases 

We start with the description of the distributed 
parallel Grobner base algorithm, the infrastruc- 
ture required to run it has been discussed in previ- 
ous sections. The description summarizes parts of 
[28l [29] . Figure [I] gives an overview of the involved 
classes and the middle- ware. 

The main part of the distributed Grobner bases 
computation uses the same work queue (Critical- 
PairList or OrderedPairList) as the parallel ver- 
sion. From the main driver method GB ( ) , shared 
memory threads take pairs from the work queue 
and update the critical pair list by reduced S- 
polynomials. But now the threads send the pairs 
(indexes of pairs) to the distributed partners for re- 
duction over a network connection and receive the 
reduced S-polynomials from the network connec- 
tion. Standard Java object serialization is used to 
encode polynomials for network transport. 

The main method GB() initializes the critical 
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Figure 5: Distributed performance, example Kat- Figure 6: Distributed performance, example Cyclic 
sura 8. 6. 



Table 3: Distributed timings for figure [5] 



Table 4: Distributed timings for figure [6] 
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pair list and the polynomial list. The polynomial 
list is added to a distributed list. The list index 
of a polynomial is used as a key in the hash ta- 
ble. The main method then starts the reducer 
server threads and waits until they are terminated. 
The reducer servers access the critical pair list and 
send pair indexes to the remote reducer client dae- 
mons. Received polynomials are recorded, the crit- 
ical pair list is updated and termination conditions 
are checked. The reducer client daemons receive 
index pairs, performs the reduction and sends the 
resulting polynomial back. Note, only an index of 
the polynomial in the distributed list is send, not 
the polynomial itself. Only the reduction result is 
sent back once to the master and then send to the 
distributed lists. The list is cached on all part- 
ner processes and the master process maintains the 
polynomials and the index numbers. The reduction 
is performed by the distributed processes with the 
class ReductionPar which will detect asynchronous 
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updates of the cached list and restart the reduction 
from the beginning in such a case. 

To make use of multiple CPU cores on a node one 
could start multiple Java virtual machines (JVM) 
on it. This approach will however limit the avail- 
able memory per JVM, need more TCP/IP connec- 
tions and will have higher transport communication 
overhead. Also the ExecutableServer infrastruc- 
ture is capable to run multiple remote jobs in one 
JVM. This avoids multiple JVMs, but the other 
drawbacks remain the same. A better solution is 
presented in the next section. 

3.5 Distributed hybrid Grobner ba- 
ses 

In the pure distributed algorithm there is one server 
thread per client process on a compute node. In 
the new hybrid algorithm we have multiple client 
threads on a compute node. Looking at figure [7J 
this means, that for the new algorithm multiple re- 
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Figure 7: Thread to node mapping. 

ducer client threads are created. But there is still 
only one reducer server thread per compute node. 
The communication for the pure algorithm is sim- 
ple: a client requests a pair, the server sends a 
pair, the client does the reduction and sends the 
result back, then it requests a new pair. Since 
we now have multiple clients per communication 
channel this simple protocol can not be used fur- 
ther. On the reducer client side we have to ex- 
tend the protocol: request a pair, reduce it, send 
the result back, additionally receive an acknowledg- 
ment then continue to request a new pair. On the 
server side, however, the messages will appear in 
arbitrary order: pair request messages will be in- 
terleaved with result messages. To distinguish be- 
tween both types of messages we augment messages 
with tags representing the respective type. The 
handling of these tagged messages is implemented 
in class TaggedSocketChannel. The serialized ob- 
jects send through those channels are tagged with 
an unique identifier, which can then be used to re- 
ceive only certain messages. The request type mes- 
sages are received in the main method of reducer 
server threads and the result type messages are re- 
ceived independently in a new separate reducer re- 
ceiver thread. So for any compute node only one 
communication connection with the master is used 
by all threads on the node. 



Groebner bases on a grid cluster 

cyc!ic6hyb_1 computing time 

cyclic6hyb_1 ideal 

seconds 
30000 r 




Mon Apr 05 13:39:02 2010 

Figure 8: Distributed hybrid performance, example 
Cyclic 6. 
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Figure 9: Distibuted hybrid performance, example 
Katsura 8. 

4 Evaluation 

In this section we present termination and perfor- 
mance related issues. 

4.1 Termination 

In this section we sketch the termination detection 
in the Buchberger algorithm. For details see [25] . 
As the number of polynomials in the bases changes 
and as a consequence the number of critical pairs 
changes during the progress of the algorithm, there 
is no a-priori way to find out when the algorithm 
will terminate. Only the non-constructive proof of 
Dickson's lemma guarantees, that it will terminate 
at all. Termination is implicitly detected, when all 
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Table 5: Distibuted hybrid timings for figure [8] Table 6: Distibuted hybrid timings for figure [9] 
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critical pairs have been processed. 

For the sequential algorithm, where we have only 
one thread of control, the test if all critical pairs 
have been processed is sufficient for termination de- 
tection. In a multi-threaded setting this no longer 
holds. For example, all but the last thread might 
find the set of critical pairs being empty. How- 
ever, a last thread running might produce a non- 
zero reduction polynomial from which a cascade of 
new critical pairs could be produced. So if mul- 
tiple threads are used, the termination condition 
consists of two parts: 

1. the set of critical pairs to process is empty and 

2. all threads are idle and not processing a poly- 
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nomial. 

Both conditions have to be checked in a consistent 
way. 

The set of critical pairs serves as work queue. 
They are synchronized for concurrent access but 
do not block if they are empty. In case the set 
of critical pairs is empty the methods return null. 
In the hybrid distributed algorithm a thread of the 
master process is responsible for more than one dis- 
tributed thread. The processing sequence is shown 
in figure [12] Condition 2 is ensured by the atomic 
idle-count, see also hgure[7] 

4.2 Performance 

The measurements shown in this paper have all 
been taken on the hardware described in section 



2.1 and with JDK 1.6.0 with 64-bit server JVM, 
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Figure 10: Distributed hybrid performance, example Figure 12: Termination of the hybrid GB algo- 
Katsura modulo 2 127 - 1. rithm. 
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Figure 11: Distibuted hybrid performance, example 
Katsura modulo 2 3217 — 1. 

running with 9-13 GB memory, and with JAS 
version 2.3, revision 2988. The examples can be 
found in |14J . The figures show the computing 
time in seconds for a given number of threads or 
nodes, or nodes with threads per node. In the 
2-d plots we show also the speedup. The corre- 
sponding tables show the number of nodes, the 
threads/processes per node (ppn), the computing 
time in milli-seconds and the speedup. The last 
two columns show the number of polynomials put 
to the critical pair list (put) and the number of 
pairs removed from the critical pair list (rem) af- 
ter application of the criteria to avoid unnecessary 
reductions. The timings for the sequential algo- 
rithm are included with nodes and threads in 
the figures and tables. To better study the influ- 



ence of the transport overhead, the master node is 
always separated and not counted. The coefficient 
rings in the examples are the rational numbers (us- 
ing rational arithmetic in coefficients), or modular 
numbers, if a modulus is shown. As modulus we 
use Mersenne prime 12, 2 127 — 1 with 39 digits and 
Mersenne prime 18, 2 3217 - 1 with 969 digits. The 
case of rational coefficients using fraction free inte- 
ger arithmetic with taking primitive parts of poly- 
nomials remains to be studied. 

Figures [3] and [4] show timings and speedup for the 
parallel shared memory version of the algorithms. 
We achieve a speedup of 5 to 6 using 6 or 7 CPUs. 
This is quite reasonable, as we run with 2 garbage 
collection threads, which interfere with the compu- 
tation when all CPUs are occupied. Figure [4] shows 
a speedup of 143 for 3 threads by some luck. Only 
177 polynomials are added in this case to the in- 
termediate ideal bases instead of about 300 in the 
other runs. One could of course also experience 
bad luck and hit particular long intermediate ideal 
bases. See also the next section |4~3] 

Timings and speedup for the (pure) distributed 
algorithm is shown in figures [5] and [6j Figure [6] 
shows a well behaving example with some speedup 
up to 5 nodes and an extra speedup for 3 nodes. 
This time, however, the number of intermediate 
polynomials is high (318) but the number of criti- 
cal pairs remaining after application of the criteria 
is only 750. The example of figure [5] shows bad 
speedup and an extra speed-down for 7 nodes. 

Figures [8j |9j [10] and [TT] show timings for the dis- 
tributed hybrid algorithms. Example Cyclic 6 is 
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shown in figure [8] We see some extra high speedup 
of 269 for 2 nodes and 6 threads per node, and of 
50 for 5 nodes and 2 threads per node. However, 
there is also a speed-down of 0.85 for 2 nodes and 
5 threads per node. Example Katsura 8 in figure [9] 
shows speedups of 7 for 4 nodes with 2 threads per 
node and 6 nodes with 2 threads per node. A speed- 
down of 0.57 is observed for 4 nodes with 3 threads 
per node. Example Katsura 8 with modular arith- 
metic is shown in figures [10] and [TT] The example 
with a 969-digit modulus shows very smooth tim- 
ings and predictable reasonable speedup of about 
20 on 5 nodes using 40 CPUs, although the absolute 
computing times are high. For the 39-digit modu- 
lus we only see a speedup of 7 for 2 nodes with 
5 threads per node and no particular bad speed- 
down. For even larger modulus with 6002 digits, 
Mersenne prime 2 19937 — 1, the smooth timings are 
lost and more unpredictable timings return (no fig- 
ure for this case). A closer look at tables [5] [6] and 
[7] shows an overhead between 150 and 300 seconds 
for the distributed hybrid version compared to the 
sequential version. 

In summary we see that the parallel and dis- 
tributed hybrid algorithms perform well. The 
(pure) distributed algorithm is not particular good. 
This indicates, that for optimal performance we 
need to use as many shared memory CPUs as fea- 
sible. For 8 CPUs on a node it is fast for up to 6 
threads. Since we use 2 garbage collection threads 
on a node, this is quite reasonable. The commu- 
nication overhead in the distributed hybrid algo- 
rithm is quite low, as can be seen from the differ- 
ences of less than 5% between the sequential ver- 
sion and the distributed hybrid version. This is due 
to the separate distributed data structure for poly- 
nomials with asynchronous updates which avoids 
the transport of polynomials as much as possible. 
Also the serialization overhead for transport is min- 
imized by the use of marshalled objects in the dis- 
tributed data structure. The scaling obtained in 
(and table [8| also shows that the imple- 
the middle-ware and the infra-structure 



figure 
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mentation 

are quite well performing and are a good basis for 
further optimizations regarding selection strategies 
and critical pair reductions. 

We have not studied the influence of JIT opti- 
mizations in this paper. Our previous measure- 
ments [25] HS1 EH] show time improvements to | 
and to | for second and third runs of the same ex- 



ample in the same JVM. In this paper we used fresh 
JVMs for each run. 

For a discussion of further influencing factors, 
such as polynomial and coefficient sizes we must 
refer to [55] • For different selection strategies see 
the next sub-section. It remains to study the opti- 
mized Grobner base algorithms jT2J [T3J EH EI] in 
parallel and distributed versions with this respect. 
For further measurements of other algorithms see 

[23 CM US- 

4.3 Workload paradox 

As we have shown above and in [23], the shared 
memory parallel implementations scales well for up 
to 8 CPUs for a 'regular' problem but it scales 
only to 3-4 nodes for the pure distributed algorithm 
|28) . One reason is the so called 'workload para- 
dox'. It describes the paradox situation that the 
parallel and distributed algorithm have sometimes 
more work to do than the sequential algorithm. 

The problem has been discussed in [2S] for the 
pure distributed algorithm. In this paper it can be 
seen in figures [3] and [4] for the shared memory par- 
allel algorithm, in figures [5] and [6] for the pure dis- 
tributed algorithm and in figures |HJ [9] for the hybrid 
algorithm. We see that the number of polynomials 
to be considered varies from 275 to 433 and even 
to 564 in the worst case (column put). In the con- 
sequence the number of polynomials to be reduced 
varies from 1577 to 1717 in the worst case (col- 
umn rem). Therefore the speedup achieved with 
the parallel and distributed algorithms is limited 
in unlucky cases. 




Figure 13: Different sequences for pair lems (left) 
and reduced polynomials (right). 
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4.4 Selection strategies 

The main work in the Buchberger algorithm is the 
reduction of a polynomial with respect to the list 
of so far computed polynomials. Already in the se- 
quential algorithm there are several (mathematical) 
optimizations to avoid the reduction of polynomi- 
als if possible. In the parallel or distributed algo- 
rithm reduced polynomials are found in a different 
sequence order since some threads or processes may 
faster find polynomials than others. See figure [13] 
for an example for two such sequences represented 
by the least common multiple of the head terms of 
a critical pair and two sequences of head terms of 
reduced polynomials. These polynomials are then 
used to build new pairs for reduction and so the se- 
quence of polynomials for reduction is most likely 
different from the sequential case. 

By this observation it seems to be best to use the 
same order of polynomials and pairs as in sequential 
algorithm. This is to try to optimize the sequence 
of critical pairs to be similar to the sequence in 
the sequential algorithm [3J. However, since the 
selection algorithm is sequential, any optimizations 
eventually reduce the exploitable parallelism and 
could also have a negative effect. In pQ, the authors 
discuss two other approaches. 

We have studied two selection strategies, n re- 
ductions are always performed in parallel. Then 
the first 'greedy' strategy selects the first finished 
result and the second strategy selects the result in 
same sequence as reduction has started. The sec- 
ond strategy is not yet available for the hybrid al- 
gorithm. Although there are examples where the 
second strategy is better we found the first strat- 
egy to perform better and to be more robust in 
other examples. Due to space limitations we are 
not able to discuss this topic in more detail, see the 
references in section 11.21 for an overview of other 
attempts. 

5 Conclusions 

We have designed and implemented versions of 
parallel and distributed Grobner bases algorithms. 
The distributed hybrid algorithm can use multiple 
CPUs on the compute nodes and stores the poly- 
nomial list only once per node. There is only one 
communication channel per node between the mas- 



ter and the reducer thread on the nodes. It is usable 
and can give considerable speedup for 'regular' ex- 
amples and certain node numbers and CPUs per 
node numbers. The sometimes higher workload in 
the parallel and distributed algorithms - the work- 
load paradox - limits the applicability in unlucky 
constellations. We have also shown that the imple- 
mentation, the middle-ware and the infra-structure 
are quite well performing and are a good basis 
for further optimizations. The implementation fits 
into the designed hierarchy of Grobner bases classes 
and the classes are type-safe designed with Java's 
generic types and work for all (implemented exact) 
fields. 

As we have written in [29], future topics to ex- 
plore, include the study of the run-time behavior of 
the algorithm and optimized variants, the investi- 
gation of different grid middle-wares, the evaluation 
of direct InfiniBand communication, to improve ro- 
bustness against node failures or bad reductions. 



As mentioned in section 1.2 there are many (also 
mathematical) improvements and optimizations for 
the sequential Grobner bases algorithm. These im- 
provements are hard (eventually impossible) to be 
carried over to an parallel algorithm which and is 
a topic of ongoing research in the area. A possible 
parallelization method which has not been studied 
up to now is on a higher level. It is known that 
the computation of Grobner bases highly depends 
on the chosen term ordering. So a possible algo- 
rithm could start the computation with respect to 
several term orderings and use 'good' intermedi- 
ate results from each computation. The compu- 
tation of comprehensive Grobner bases could be 
parallelized by computing the subtrees on different 
threads III ED. 
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Table 7: Distributed hybrid timings for figure 
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Table 8: Distributed hybrid timings for figure 
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